智能论文笔记

Towards Language Modelling in the Speech Domain Using Sub-word Linguistic Units

Anurag Katakkar , Alan W Black

分类：自然语言处理 | 机器学习

2021-10-31

文本数据的语言模型（LMS）已经广泛研究了语言生成和其他下游任务的实用性。然而，纯粹在语音域中的语言建模仍然是一个相对未开发的主题，具有传统语音LMS，通常根据用于学习语言的分布方面的辅助文本LMS。对于英语语言，这些LMS将单词视为原子单位，这提出了语言域中语言建模的固有挑战。在本文中，我们提出了一种新的基于LSTM的生成语音LM，它受CBY模型的启发，并建立在包括音节和音素的语言单元上。这在数据集中的话语中提供了更好的声学一致性，而不是单个MelspectRoge框架或整个单词。使用有限的数据集，比当代生成型号规模小的数量级，我们的模型非常近似于潺潺声音。我们展示了培训与辅助文本LMS，多任务学习目标和辅助关节特征的影响。通过我们的实验，我们还强调了一些众所周知的，但在培训生成语音LMS中记录的挑战不良，包括这些模型培训的监督学习目标之间的不匹配，例如平均平方误差（MSE），以及真实目标是语音质量。我们的实验提供了早期迹象表明，验证损失和MCD）与生成的语音质量没有强烈相关，传统的文本语言建模度量，如困惑和下一个令牌预测准确性。

translated by 谷歌翻译

Atrous Space Bender U-Net (ASBU-Net/LogiNet)

Anurag Bansal , Oleg Ostap , Miguel Maestre Trueba , Kristopher Perry

分类：计算机视觉

2022-12-16

$ $With recent advances in CNNs, exceptional improvements have been made in semantic segmentation of high resolution images in terms of accuracy and latency. However, challenges still remain in detecting objects in crowded scenes, large scale variations, partial occlusion, and distortions, while still maintaining mobility and latency. We introduce a fast and efficient convolutional neural network, ASBU-Net, for semantic segmentation of high resolution images that addresses these problems and uses no novelty layers for ease of quantization and embedded hardware support. ASBU-Net is based on a new feature extraction module, atrous space bender layer (ASBL), which is efficient in terms of computation and memory. The ASB layers form a building block that is used to make ASBNet. Since this network does not use any special layers it can be easily implemented, quantized and deployed on FPGAs and other hardware with limited memory. We present experiments on resource and accuracy trade-offs and show strong performance compared to other popular models.

translated by 谷歌翻译

Urban Scene Semantic Segmentation with Low-Cost Coarse Annotation

Anurag Das , Yongqin Xian , Yang He , Zeynep Akata , Bernt Schiele

分类：计算机视觉

2022-12-15

For best performance, today's semantic segmentation methods use large and carefully labeled datasets, requiring expensive annotation budgets. In this work, we show that coarse annotation is a low-cost but highly effective alternative for training semantic segmentation models. Considering the urban scene segmentation scenario, we leverage cheap coarse annotations for real-world captured data, as well as synthetic data to train our model and show competitive performance compared with finely annotated real-world data. Specifically, we propose a coarse-to-fine self-training framework that generates pseudo labels for unlabeled regions of the coarsely annotated data, using synthetic data to improve predictions around the boundaries between semantic classes, and using cross-domain data augmentation to increase diversity. Our extensive experimental results on Cityscapes and BDD100k datasets demonstrate that our method achieves a significantly better performance vs annotation cost tradeoff, yielding a comparable performance to fully annotated data with only a small fraction of the annotation budget. Also, when used as pretraining, our framework performs better compared to the standard fully supervised setting.

translated by 谷歌翻译

Audiovisual Masked Autoencoders

Mariana-Iuliana Georgescu , Eduardo Fonseca , Radu Tudor Ionescu , Mario Lucic , Cordelia Schmid , Anurag Arnab

分类：计算机视觉

2022-12-09

Can we leverage the audiovisual information already present in video to improve self-supervised representation learning? To answer this question, we study various pretraining architectures and objectives within the masked autoencoding framework, motivated by the success of similar methods in natural language and image understanding. We show that we can achieve significant improvements on audiovisual downstream classification tasks, surpassing the state-of-the-art on VGGSound and AudioSet. Furthermore, we can leverage our audiovisual pretraining scheme for multiple unimodal downstream tasks using a single audiovisual pretrained model. We additionally demonstrate the transferability of our representations, achieving state-of-the-art audiovisual results on Epic Kitchens without pretraining specifically for this dataset.

translated by 谷歌翻译

MVRackLay: Monocular Multi-View Layout Estimation for Warehouse Racks and Shelves

Pranjali Pathre , Anurag Sahu , Ashwin Rao , Avinash Prabhu , Meher Shashwat Nigam , Tanvi Karandikar , Harit Pandya , K. Madhava Krishna

分类：计算机视觉 | 机器人

2022-11-30

In this paper, we propose and showcase, for the first time, monocular multi-view layout estimation for warehouse racks and shelves. Unlike typical layout estimation methods, MVRackLay estimates multi-layered layouts, wherein each layer corresponds to the layout of a shelf within a rack. Given a sequence of images of a warehouse scene, a dual-headed Convolutional-LSTM architecture outputs segmented racks, the front and the top view layout of each shelf within a rack. With minimal effort, such an output is transformed into a 3D rendering of all racks, shelves and objects on the shelves, giving an accurate 3D depiction of the entire warehouse scene in terms of racks, shelves and the number of objects on each shelf. MVRackLay generalizes to a diverse set of warehouse scenes with varying number of objects on each shelf, number of shelves and in the presence of other such racks in the background. Further, MVRackLay shows superior performance vis-a-vis its single view counterpart, RackLay, in layout accuracy, quantized in terms of the mean IoU and mAP metrics. We also showcase a multi-view stitching of the 3D layouts resulting in a representation of the warehouse scene with respect to a global reference frame akin to a rendering of the scene from a SLAM pipeline. To the best of our knowledge, this is the first such work to portray a 3D rendering of a warehouse scene in terms of its semantic components - Racks, Shelves and Objects - all from a single monocular camera.

translated by 谷歌翻译

Learnings from Technological Interventions in a Low Resource Language: Enhancing Information Access in Gondi

Devansh Mehta , Harshita Diddee , Ananya Saxena , Anurag Shukla , Sebastin Santy , Ramaravind Kommiya Mothilal , Brij Mohan Lal Srivastava , Alok Sharma , Vishnu Prasad , Venkanna U

分类：自然语言处理

2022-11-29

The primary obstacle to developing technologies for low-resource languages is the lack of representative, usable data. In this paper, we report the deployment of technology-driven data collection methods for creating a corpus of more than 60,000 translations from Hindi to Gondi, a low-resource vulnerable language spoken by around 2.3 million tribal people in south and central India. During this process, we help expand information access in Gondi across 2 different dimensions (a) The creation of linguistic resources that can be used by the community, such as a dictionary, children's stories, Gondi translations from multiple sources and an Interactive Voice Response (IVR) based mass awareness platform; (b) Enabling its use in the digital domain by developing a Hindi-Gondi machine translation model, which is compressed by nearly 4 times to enable it's edge deployment on low-resource edge devices and in areas of little to no internet connectivity. We also present preliminary evaluations of utilizing the developed machine translation model to provide assistance to volunteers who are involved in collecting more data for the target language. Through these interventions, we not only created a refined and evaluated corpus of 26,240 Hindi-Gondi translations that was used for building the translation model but also engaged nearly 850 community members who can help take Gondi onto the internet.

translated by 谷歌翻译

Is Conditional Generative Modeling all you need for Decision-Making?

Anurag Ajay , Yilun Du , Abhi Gupta , Joshua Tenenbaum , Tommi Jaakkola , Pulkit Agrawal

分类：机器学习 | 人工智能

2022-11-28

Recent improvements in conditional generative modeling have made it possible to generate high-quality images from language descriptions alone. We investigate whether these methods can directly address the problem of sequential decision-making. We view decision-making not through the lens of reinforcement learning (RL), but rather through conditional generative modeling. To our surprise, we find that our formulation leads to policies that can outperform existing offline RL approaches across standard benchmarks. By modeling a policy as a return-conditional diffusion model, we illustrate how we may circumvent the need for dynamic programming and subsequently eliminate many of the complexities that come with traditional offline RL. We further demonstrate the advantages of modeling policies as conditional diffusion models by considering two other conditioning variables: constraints and skills. Conditioning on a single constraint or skill during training leads to behaviors at test-time that can satisfy several constraints together or demonstrate a composition of skills. Our results illustrate that conditional generative modeling is a powerful tool for decision-making.

translated by 谷歌翻译

RecXplainer: Post-Hoc Attribute-Based Explanations for Recommender Systems

Sahil Verma , Anurag Beniwal , Narayanan Sadagopan , Arjun Seshadri

分类：人工智能 | 机器学习

2022-11-27

Recommender systems are ubiquitous in most of our interactions in the current digital world. Whether shopping for clothes, scrolling YouTube for exciting videos, or searching for restaurants in a new city, the recommender systems at the back-end power these services. Most large-scale recommender systems are huge models trained on extensive datasets and are black-boxes to both their developers and end-users. Prior research has shown that providing recommendations along with their reason enhances trust, scrutability, and persuasiveness of the recommender systems. Recent literature in explainability has been inundated with works proposing several algorithms to this end. Most of these works provide item-style explanations, i.e., `We recommend item A because you bought item B.' We propose a novel approach, RecXplainer, to generate more fine-grained explanations based on the user's preference over the attributes of the recommended items. We perform experiments using real-world datasets and demonstrate the efficacy of RecXplainer in capturing users' preferences and using them to explain recommendations. We also propose ten new evaluation metrics and compare RecXplainer to six baseline methods.

translated by 谷歌翻译

Leveraging Heteroscedastic Uncertainty in Learning Complex Spectral Mapping for Single-channel Speech Enhancement

Kuan-Lin Chen , Daniel D. E. Wong , Ke Tan , Buye Xu , Anurag Kumar , Vamsi Krishna Ithapu

分类：机器学习

2022-11-16

Most speech enhancement (SE) models learn a point estimate, and do not make use of uncertainty estimation in the learning process. In this paper, we show that modeling heteroscedastic uncertainty by minimizing a multivariate Gaussian negative log-likelihood (NLL) improves SE performance at no extra cost. During training, our approach augments a model learning complex spectral mapping with a temporary submodel to predict the covariance of the enhancement error at each time-frequency bin. Due to unrestricted heteroscedastic uncertainty, the covariance introduces an undersampling effect, detrimental to SE performance. To mitigate undersampling, our approach inflates the uncertainty lower bound and weights each loss component with their uncertainty, effectively compensating severely undersampled components with more penalties. Our multivariate setting reveals common covariance assumptions such as scalar and diagonal matrices. By weakening these assumptions, we show that the NLL achieves superior performance compared to popular losses including the mean squared error (MSE), mean absolute error (MAE), and scale-invariant signal-to-distortion ratio (SI-SDR).

translated by 谷歌翻译

Learning-Based Radiomic Prediction of Type 2 Diabetes Mellitus Using Image-Derived Phenotypes

Michael S. Yao , Allison Chae , Matthew T. MacLean , Anurag Verma , Jeffrey Duda , James Gee , Drew A. Torigian , Daniel Rader , Charles Kahn , Walter R. Witschey

分类：机器学习 | 人工智能

2022-09-20

2型糖尿病（T2DM）的早期诊断对于及时的治疗干预措施和生活方式改变至关重要。随着医学成像数据在许多患者群体中变得更广泛可用，我们试图研究是否可以在表格学习分类器模型中利用图像衍生的表型数据来预测T2DM的发病率，而无需使用侵入性血液实验室测量。我们表明，使用图像衍生表型的神经网络和决策树模型都可以预测患者T2DM状态的召回评分高达87.6％。我们还提出了与“ Syntha1c编码器”相同的结构的新颖使用，这些结构能够输出模仿血液血红蛋白A1C经验实验室测量值的可解释值。最后，我们证明了T2DM风险预测模型对输入矢量成分中小扰动的敏感性可用于预测从以前看不见的患者人群中取样的协变量的性能。

translated by 谷歌翻译